Owner: Daniel Soukup - Created: 2025.11.01
In this final recipe, we run in-depth evaluation on our best performing model predictions using a number of classification metrics and visuals. Note that our model selection was technically handled by the Modeling notebook and the automated hyperparameter tuning flow. Again, our focus is to understand model accuracy in the general sense as well as specific shortcomings to motivate future model (and data processing) iterations.
NOTE: the markdown in the notebook refers to our latest run and subsequent re-training/re-evaluation would likely change the exact numbers in the output fields.
We load the prediction datasets for analysis.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
predictions_learn = dataiku.Dataset("predictions_learn")
predictions_learn_df = predictions_learn.get_dataframe()
predictions_test = dataiku.Dataset("predictions_test")
predictions_test_df = predictions_test.get_dataframe()
predictions_learn_df.head()
| income | pred | pred_proba | |
|---|---|---|---|
| 0 | 0 | 0 | 0.001265 |
| 1 | 0 | 0 | 0.087387 |
| 2 | 0 | 0 | 0.000528 |
| 3 | 0 | 0 | 0.000169 |
| 4 | 0 | 0 | 0.000169 |
TARGET = 'income'
datasets = {
'train': predictions_learn_df,
'test': predictions_test_df
}
Lets look at high level statistics of the predictions first.
for name, data in datasets.items():
print(name, 'summary:')
print(data.describe())
print('\n\n')
train summary:
income pred pred_proba
count 152807.000000 152807.000000 152807.000000
mean 0.080625 0.036634 0.080685
std 0.272259 0.187863 0.159195
min 0.000000 0.000000 0.000096
25% 0.000000 0.000000 0.003254
50% 0.000000 0.000000 0.019552
75% 0.000000 0.000000 0.076145
max 1.000000 1.000000 0.997850
test summary:
income pred pred_proba
count 78826.000000 78826.00000 78826.000000
mean 0.078198 0.03542 0.078198
std 0.268484 0.18484 0.156972
min 0.000000 0.00000 0.000096
25% 0.000000 0.00000 0.002874
50% 0.000000 0.00000 0.018293
75% 0.000000 0.00000 0.073341
max 1.000000 1.00000 0.997846
We can already see that while the real labels contain ~8% high income, our predictions were positive only ~3.5% of the time. However, the predicted probabilities are fairly well calibrated showing the same mean almost exactly. We will experiement with alternative cutoffs to the default 0.5 to find better precision/recall trade-offs later in the notebook.
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()
fig = datasets['train'].plot(kind='hist', backend='plotly', x='pred_proba', color='income', log_y=True, opacity=0.7)
fig.show()
We can see the probability distribution for the true 0/1 labels. For class 0, we are nicely concentrated on 0 as expected, however we do see a fair number of class 1 samples with low probabilities (e.g., instances misclassified by a large margin). Note the log scale on the y axis.
As future work, we can explore the segment of data where these misclassifications occurred to better understand how to address the issue (e.g., are certain groups over represented under the misclassified samples).
We will look at the standard binary classification metrics using:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
def get_classification_reports(datasets) -> None:
for name, data in datasets.items():
print(name, "\n\n")
print(classification_report(data[TARGET], data['pred']))
print("\n\n")
get_classification_reports(datasets)
train
precision recall f1-score support
0 0.95 0.99 0.97 140487
1 0.79 0.36 0.50 12320
accuracy 0.94 152807
macro avg 0.87 0.68 0.73 152807
weighted avg 0.93 0.94 0.93 152807
test
precision recall f1-score support
0 0.95 0.99 0.97 72662
1 0.77 0.35 0.48 6164
accuracy 0.94 78826
macro avg 0.86 0.67 0.72 78826
weighted avg 0.93 0.94 0.93 78826
Observations:
The confusion matrices give another view of the correct and misclassified samples:
import matplotlib.pyplot as plt
ConfusionMatrixDisplay.from_predictions(predictions_learn_df['income'], predictions_learn_df['pred'])
plt.show()
ConfusionMatrixDisplay.from_predictions(predictions_test_df['income'], predictions_test_df['pred'])
plt.show()
The misclassifed samples are off the diagonal:
As mentioned previously, we can consider adjusting our hard predictions to achieve a better precision-recall tradeoff. We can see this on the precision-recall curve - note that we used the area under this curve as the loss function to optimize during training our XGBoost models.
from sklearn.metrics import PrecisionRecallDisplay
display = PrecisionRecallDisplay.from_predictions(
datasets['test'][TARGET],
datasets['test']['pred_proba'],
name="XGBoost",
plot_chance_level=True,
despine=True
)
plt.axvline(0.37, c="black") # mark the recall-precion we got from eval
plt.show()
We could select an alternative threshold:
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(
datasets['train'][TARGET],
datasets['train']['pred_proba'],
)
pd.DataFrame(
{
'precision': precision[::1500],
'recall': recall[::1500],
'threshold': thresholds[::1500]
}
).set_index('threshold').T
| threshold | 0.000096 | 0.000742 | 0.001184 | 0.001573 | 0.002018 | 0.002539 | 0.003077 | 0.003713 | 0.004395 | 0.005165 | 0.005983 | 0.006852 | 0.007797 | 0.008765 | 0.009827 | 0.010938 | 0.012115 | 0.013358 | 0.014696 | 0.016135 | 0.017712 | 0.019354 | 0.021098 | 0.023076 | 0.025185 | 0.027408 | 0.029959 | 0.032654 | 0.035552 | 0.038696 | 0.042378 | 0.046164 | 0.050410 | 0.054904 | 0.059893 | 0.065424 | 0.071741 | 0.079085 | 0.088249 | 0.098399 | 0.111032 | 0.124720 | 0.142652 | 0.164242 | 0.192990 | 0.229438 | 0.279583 | 0.347864 | 0.433548 | 0.553797 | 0.708939 | 0.909283 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| precision | 0.080625 | 0.092372 | 0.094876 | 0.097671 | 0.101055 | 0.104009 | 0.106572 | 0.109636 | 0.112506 | 0.115399 | 0.118290 | 0.121056 | 0.124432 | 0.127263 | 0.130471 | 0.133450 | 0.136883 | 0.140821 | 0.144680 | 0.148474 | 0.152545 | 0.157253 | 0.162072 | 0.168002 | 0.173802 | 0.179265 | 0.185817 | 0.191868 | 0.198891 | 0.206476 | 0.214371 | 0.222911 | 0.231243 | 0.241310 | 0.251527 | 0.263229 | 0.275950 | 0.289841 | 0.305540 | 0.323176 | 0.344222 | 0.367406 | 0.396656 | 0.428661 | 0.474794 | 0.530475 | 0.589667 | 0.664892 | 0.745859 | 0.824104 | 0.910399 | 0.976208 |
| recall | 1.000000 | 1.000000 | 0.999838 | 0.999756 | 0.999513 | 0.999107 | 0.998701 | 0.998214 | 0.997890 | 0.996997 | 0.996104 | 0.995049 | 0.993994 | 0.992776 | 0.991964 | 0.990584 | 0.988961 | 0.988068 | 0.986607 | 0.984578 | 0.982630 | 0.979870 | 0.977192 | 0.973539 | 0.969724 | 0.965179 | 0.962419 | 0.957549 | 0.952435 | 0.948214 | 0.942695 | 0.936769 | 0.928896 | 0.921916 | 0.912338 | 0.902029 | 0.890341 | 0.877029 | 0.862256 | 0.842776 | 0.821997 | 0.791802 | 0.758766 | 0.724756 | 0.679627 | 0.626623 | 0.560471 | 0.486526 | 0.405682 | 0.328571 | 0.225974 | 0.106575 |
We can see that if we are to sacrifice some precision, the recall can be brought up e.g, with ~0.2-0.3 threshold we can achieve an almost even precision and recall around 0.5-.06.
Lets look at the ROC-AUC scores and ROC curve finally:
from sklearn.metrics import roc_auc_score, roc_curve, RocCurveDisplay
def get_roc_scores(datasets) -> None:
for name, data in datasets.items():
print(name)
print(roc_auc_score(data[TARGET], data['pred_proba']))
print("\n")
get_roc_scores(datasets)
train 0.9258060949439421 test 0.9251752458214115
In the plots below we show the TPR rate against the FPR: ideally, we can achieve high TPR with low FPR values.
RocCurveDisplay.from_predictions(predictions_test_df[TARGET], predictions_test_df["pred_proba"])
RocCurveDisplay.from_predictions(predictions_learn_df[TARGET], predictions_learn_df["pred_proba"])
plt.show()
That is, ideally, the curve is hugging the top left corner (as we see it) but again, this metric is not sensitive enough for imbalanced datasets.
To better understand model performance: